Subcategory of Machine Learning. Consists in using large NN models (i.e. with a high number of layers) to solve complex problems.
Structure of a NN model
Basic Elements
Neuron
Takes multiple inputs, sums them with weights and passes the result as output.
Layer
Set of similar neurons taking different inputs and/or having different weights.
Neural Network (NN)
Sequence of layers.
Linear Functions
Linear Function
Function that can be written like this: \[
f(\alpha_1, \cdots, \alpha_n) = (\beta_{1,1} \alpha_1 + \cdots + \beta_{1,n} \alpha_n, \cdots, \beta_{m,1} \alpha_1 + \cdots + \beta_{m,n} \alpha_n)
\]
Composition of Linear Functions
The composition of any number of linear functions is a linear function.
Activation Functions - Definition
Activation Function
Function applied to the output of a NN layer (i.e. to the output of each of its neurons) to introduce non-linearity to the model.
Activation functions allow to approximate much more complex functions, using a sequence of intertwined affine layers and activation layers.
Type of layers designed to process sequential data such as text, time series data, speech or audio. Works by combining input data and the state of the previous time step.
Nowadays, transformer architectures are however preferred to process sequential data.
Pooling
A type of layers used to reduce the number of features by merging multiple features into one. There are multiple kinds of pooling layers, the most simple ones being Maximum Pooling and Average Pooling.
A Residual Block aims at stabilizing training and convergence of deep neural networks (with a large number of layers), by adding the input of a given layer to the output of another layer further down in the architecture.
Attention aims at determining relative importance of each part of the input to make better predictions. It is used a lot in natural language processing (NLP) and image processing.
Example of fully connected network made with NN-SVG
PyTorch
import torchimport torch.nn as nn# Simple fully connected model with 2 hidden layersclass SimpleMLP(nn.Module):def__init__(self):super(SimpleMLP, self).__init__()self.fc1 = nn.Linear(2, 20) # Input layer to 1st hidden layerself.fc2 = nn.Linear(20, 10) # 1st hidden layer to 2nd hidden layerself.fc3 = nn.Linear(10, 1) # 2nd hidden layer to output layerdef forward(self, x): x = torch.relu(self.fc1(x)) # ReLU activation after first layer x = torch.relu(self.fc2(x)) # ReLU activation after second layer x =self.fc3(x) # Output layer (no activation for regression tasks)return x
Tensorflow + Keras
import tensorflow as tffrom tensorflow.keras import layers, Model# Simple fully connected model with 2 hidden layersclass SimpleMLP(Model):def__init__(self):super(SimpleMLP, self).__init__()self.fc1 = layers.Dense(20, activation='relu') # Input layer to 1st hidden layerself.fc2 = layers.Dense(10, activation='relu') # 1st hidden layer to 2nd hidden layerself.fc3 = layers.Dense(1) # 2nd hidden layer to output layer (no activation for regression)def call(self, inputs): x =self.fc1(inputs) x =self.fc2(x)returnself.fc3(x)
Optimization
Why Optimization?
The number of possible combinations of parameters is huge, even with small NN models. To (hopefully) find the best possible combination, we need two things:
A way to evaluate any combination
A way to find well-performing combinations
Loss Function
Definition
A loss function is a mathematical function that quantifies the difference between the network’s predicted output and the actual target values. The goal during training is to minimize this loss by adjusting the model’s weights, using gradient descent.
The most common loss functions are:
Mean Squares Error (MSE) for regression
Cross-entropy Loss for classification
Differentiable
To be able to perform gradient descent, the loss function must be differentiable, which means continuous (no jump) and smooth (no sudden change of direction).
To get the best results when performing gradient descent, it is also better if the function is convex. The simplest definition of convexity is that if you trace a straight line between two points on the curve, the curve will be below the segment between the two points.
The process of iteratively computing and following the direction of the gradient of a function to (hopefully) reach the minimum value of the function (if it exists).
Gradient Descent works because at any point in the definition space of the function, the gradient points in the direction of the steepest angle. So locally, following this direction is the quickest way to get to the lower value of the function. If we come back to the requirements listed before:
Differentiable functions are functions where the gradient exists everywhere
Convex functions are convenient for gradient descent because they have only one minimum value and slowly going down the function will always lead to the minimum value.
Algorithm
Gradient Descent boils down to iteratively:
Compute the gradient of the loss function at the current point
Make a step towards the direction of the gradient to a new point
Repeat step 1 until we stop
In this process, the three things that have to be defined are:
The starting point (weights initialization)
The size of the steps (learning rate)
The condition to stop
Weights initialization
The starting point is defined by the first output of the model, and therefore by the initial values of the weights of the model. There are numerous methods to initialize the weights, but the most common one is to randomly initialize them using a centered and normalize Gaussian distribution.
Learning rate
The gradient gives us a direction and a norm, but this norm is arbitrary and has to be rescaled using what we call the learning rate. The learning rate doesn’t define the size of the steps, but the scalar factor to apply to the gradient’s norm, which means that the norm still plays a crucial role.
The choice of the learning rate is crucial to hopefully converge quickly to the global minimum loss.
Example of gradient descent on the same function with different learning rates[15]
Stop condition
The stop condition determines when you decide to stop the algorithm. An easy solution is to choose a number of steps before launching the algorithm, but this will either imply useless computations after the algorithm has reached a final point, or stopping too early and not get the best results possible.
Therefore, although there are more complex methods, the most common and simple process is to monitor the value of the loss, memorize the lowest value ever reached, and stop when there has been a given number of steps without any improvement to the best value. Then, we usually keep the model weights corresponding to this best value.
Gradient Descent is beautiful, but right now, we only know in which direction (the gradient) the output of the model should go. To transmit this information to the weights of the layers of the model, we use backpropagation.
Definition
Backpropagation
The process of computing the gradient of the weights of each layer of the model and modify them accordingly. The name comes from the process starting with the last layer of the model and propagating incrementally to the first layer.
How?
Backpropagation involves a lot of computations of partial derivatives, which are individually not difficult but very bothersome. Happily, NN libraries handle backpropagation automatically by calling only one function, so no need to worry about it.
Examples
PyTorch
import torchimport torch.nn as nnimport torch.optim as optimfrom torch.utils.data import DataLoader, TensorDataset# Generate some simple dataX = torch.randn(100, 2) # 100 samples, 2 featuresy = (2* X[:, 0]**3+0.5* X[:, 1]**2+ torch.randn(100) *0.5).unsqueeze(1)# Create a DataLoader for batchingbatch_size =32dataloader = DataLoader(TensorDataset(X, y), batch_size=batch_size, shuffle=True)# Instantiate the modelmodel = SimpleMLP() # A simple model that is assumed to be defined before# Define a loss function and optimizercriterion = nn.MSELoss() # Mean Squared Error Lossoptimizer = optim.SGD(model.parameters(), lr=0.01) # Stochastic Gradient Descent# Training loopepochs =1000epoch_losses = [] # List to store average loss at each epochfor epoch inrange(epochs): epoch_loss =0.0# Accumulate loss for this epochfor batch_X, batch_y in dataloader:# Forward pass predictions = model(batch_X) # Compute the model's predictions batch_loss = criterion(predictions, batch_y) # Compare predictions to true values# Backward pass and optimization optimizer.zero_grad() # Reset gradient batch_loss.backward() # Backward pass optimizer.step() # Update parameters# Accumulate weighted loss for the batch epoch_loss += batch_loss.item() * batch_X.size(0) epoch_loss /=len(dataloader.dataset) # Normalize by dataset size to get average epoch loss epoch_losses.append(epoch_loss) # Store the epoch's total loss# Print loss every 10 epochsif (epoch +1) %100==0:print(f"Epoch {epoch +1:>4}/{epochs}, Loss: {epoch_loss:.4f}")
Tensorflow + Keras
import tensorflow as tffrom tensorflow.keras import layers, Modelimport numpy as np# Generate some simple dataX = np.random.randn(100, 2).astype(np.float32) # 100 samples, 2 featuresy = (2* X[:, 0]**3+0.5* X[:, 1]**2+ np.random.randn(100) *0.5).astype(np.float32)y = y.reshape(-1, 1) # Reshape y to (100, 1)# Create a Dataset for batchingbatch_size =32dataset = tf.data.Dataset.from_tensor_slices((X, y)).batch(batch_size)# Instantiate the modelmodel = SimpleMLP() # A simple model that is assumed to be defined before# Define a loss function and optimizerloss_fn = tf.keras.losses.MeanSquaredError() # Mean Squared Error Lossoptimizer = tf.keras.optimizers.SGD(learning_rate=0.01) # Stochastic Gradient Descent# Training loopepochs =1000epoch_losses = [] # List to store average loss at each epochfor epoch inrange(epochs): epoch_loss =0.0# Accumulate loss for this epochfor batch_X, batch_y in dataset:# Forward passwith tf.GradientTape() as tape: predictions = model(batch_X) # Compute the model's predictions batch_loss = loss_fn(batch_y, predictions) # Compare predictions to true values# Backward pass and optimization gradients = tape.gradient(loss, model.trainable_variables) # Backward pass optimizer.apply_gradients(zip(gradients, model.trainable_variables)) # Update parameters# Accumulate weighted loss for the batch epoch_loss += batch_loss.numpy() *len(batch_X) epoch_loss /=len(X) # Normalize by dataset size to get average epoch loss epoch_losses.append(epoch_loss) # Store the epoch's total loss# Print loss every 10 epochsif (epoch +1) %100==0:print(f"Epoch {epoch +1:>4}/{epochs}, Loss: {epoch_loss:.4f}")
Transfer Learning
Introduction
Principle
Start Closer to the Goal
The basic idea of Transfer Learning is to use a model that was pre-trained on a similar task. This initial model should have learnt basic knowledge that is common to its initial task and our new task, making it capable of learning the new task faster.
Reasons
Limited amount of data for the new task
New task is similar to old task
Training a model is very costly
Better final performance with less overfitting
Applications
Two major applications:
Computer Vision
Natural Language Processing
Categories
Fine-tuning
Idea
Take a pre-trained model as is, freeze some of the layers (usually the first layers) and continue the training where it was stopped with our new dataset.
Challenges
Find a good pre-trained model
Have a proper dataset (even if fine-tuning works with smaller datasets)
Freeze the right number of layers (more if the tasks are similar)
Train the model properly (useful to know how it was pre-trained)